Introduction to Open Data Science - Week 1, Intro

1.1 I am rather exited to start this course, although I am a bit worried that the workload will be too much for me. I hope though, that if I manage with the the workload, I will have the skills to start independently developing as a user of R and a contributor to open data science. It is especially the latter point that brought me to this course, although I did not actively look for it. I just happened to run into it on the course page of the University of Helsinki.

1.2 In what has been a rather unpleasant experience, GitHub did not immediately work for me. Nothing seemed to upload to my diary. Now, I think, I have managed to overcome my issues and get things to work most of the time. Learning by doing, I suppose!

To summarize:
- I want to learn open data science.
- I am worried the workload will be too much.
- I look forward to the course.
- I hope I will figure out the specifics of git and rstudio.

Here’s the link to by GitHub repository.


Introduction to Open Data Science - Week 2, Regression

Libraries:

library(dplyr)
library(ggplot2)
library(GGally)

This chapter analyses a selection of data from a 2014 survey of students participating in an introductory statistics course in Finland. The survey mapped students’ learning approaches and learning achievements. While the original data contained 183 observations of 60 variables, a more limited dataset of 166 observations of 7 variables will be employed here. These variables are the age and gender of the participants, their points from the course representing their performance, their attitude towards the course, and three variables mapping their learning styles. These learning styles were the “surface approach,” indicating memorization without deeper engagement, “deep approach,” indicating an intention to maximize understanding of the subject matter, and “strategic approach,” indicating an approach aimed at maximizing the students chance at a good grade. The variables “attitude,” “surface approach,” “deep approach,” and “strategic approach” are all aggregate mean measures of other variables. As such, each variable summarizes related observations into an average. This analysis used the below script, in combination with existing knowledge, to interpret the dataset:

Learn2014 <- read.table("Data/Learn2014", header = TRUE, sep = "\t")
Learn2014$gender <- factor(Learn2014$gender, levels=c("0","1"), labels=c(0,1))
str(Learn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age   : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ gender: Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 2 ...
##  $ attit : num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep  : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf  : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ strat : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Points: int  25 12 24 10 22 21 21 31 24 26 ...

The below graphs and summaries of the data help us gain an initial picture of the trends present therein. For one, we can see that a vast majority of students participating in this survey were female (110 v. 56 males), with a mean age of 25 and a half years and approx. 75% of students being below the age of 27.

As for the variables related to studying, all of them approximate a normal distribution, although with a slight skew to the right. Certain immediately interesting pieces of information arise from the correlation numbers. Firstly, positive attitude is strongly correlated with higher points, while the deep approach seems to counter intuitively have little effect on performance. Nevertheless the surface approach seems to predict a slightly worse performance, while the strategic approach predicts a slightly better performance. Curiously, age among men seems to predict a worser performance, although this might be due to two outliers. We shall next test these initial findings with a multiple linear regression.

Graph_AgeGeN <- ggpairs(Learn2014, columns = c(1, 2), legend = 3, title = "Age and Gender", mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))

Graph_AgeGenPoints <- ggpairs(Learn2014, columns = c(1,2,7), title = "Effects of Age and Gender on Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))

Graph_PredPoints <- ggpairs(Learn2014, columns = c(3:7), title = "Attitude, Study Style and Points", mapping = aes(shape = gender, col = gender), lower = list(combo = wrap("facethist", bins = 20)))
Graph_AgeGeN

Graph_AgeGenPoints

Graph_PredPoints

summary(Learn2014)
##       Age        gender      attit            deep            surf      
##  Min.   :17.00   0: 56   Min.   :1.400   Min.   :1.583   Min.   :1.583  
##  1st Qu.:21.00   1:110   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.417  
##  Median :22.00           Median :3.200   Median :3.667   Median :2.833  
##  Mean   :25.51           Mean   :3.143   Mean   :3.680   Mean   :2.787  
##  3rd Qu.:27.00           3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.167  
##  Max.   :55.00           Max.   :5.000   Max.   :4.917   Max.   :4.333  
##      strat           Points     
##  Min.   :1.250   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:19.00  
##  Median :3.188   Median :23.00  
##  Mean   :3.121   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :33.00

For the below multiple linear regression, three predictor variable have been chosen: Attitude, the surface approach, and the strategic approach. These variables were chosen due to their relatively higher correlations compared to other available variables (age for males is excluded due to the presence of outliers skewing the calculation). The below multiple linear regression shows that only attitude has a statistically significant impact on points, as it is the only independent variable that has its p-value below 0.05. In the case of attitude, there is a less than 0.1 percent chance that the null-hypothesis (attitude has no effect on points) is correct under the observed circumstances. Not only is attitude a statistically significant predictor of points, it also seems to have a strong impact, with its beta coefficient being approx. 3.4. This means that with each 1-point step towards a better attitude on the linkert scale, points seem to rise approximately by 3.4.

With the remainder of data, the likelihood is above 5 percent, which is the conventional cut line for statistical significance. This interpretation is also supported by the t-values, which conventionally are expected to be larger than 2, or lesser than -2, to indicate statistical significance. Altogether, this model nevertheless only explain approximately 20% of the variation in points, meaning that is not a very good predictive model.

Points_regression <- lm(Points ~ attit + strat + surf, data = Learn2014)
summary(Points_regression)
## 
## Call:
## lm(formula = Points ~ attit + strat + surf, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attit         3.3952     0.5741   5.913 1.93e-08 ***
## strat         0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

To further test the significance of attitude, the Surface Approach and Strategic Approach variables will be removed and a simple linear regression carried out with just attitude as the predictor variable. This, nevertheless, produced no novel results and with the dropping of variables, also the explanatory power, Multiple R_squared, of the model has gone down from 0.2 to 0.19. This means that changes in students’ attitude can help explain 19% of the changes in students’ score. The fact that the reduction is so minor is further indication of the minor impact of Surface Approach and Strategic Approach variables. To play around a bit, I have also included a multiple linear regression with age included. Nevertheless, neither this has had any effect on the model. The slight rise in R-squared is to be expected every time a predictive variable is added.

Points_Attit_Reg <- lm(Points ~ attit, data = Learn2014)
summary(Points_Attit_Reg)
## 
## Call:
## lm(formula = Points ~ attit, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attit         3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
AgeAttit_Regression <- lm(Points ~ attit + Age, data = Learn2014)
summary(AgeAttit_Regression)
## 
## Call:
## lm(formula = Points ~ attit + Age, data = Learn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3354  -3.3095   0.2625   4.0005  10.4911 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.57244    2.24943   6.034 1.04e-08 ***
## attit        3.54392    0.56553   6.267 3.17e-09 ***
## Age         -0.07813    0.05315  -1.470    0.144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.301 on 163 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1913 
## F-statistic: 20.52 on 2 and 163 DF,  p-value: 1.125e-08

To validate the model, this final section will run three plots to test that the assumptions of a regression model are filled by the data. For this validation, the simple linear regression model of Points_Attit_Reg will be used, as it is the most efficient of the models produced. The below graphs “Residuals vs. Fitted,” “Normal Q-Q,” and “Residuals vs. Leverage” test whether the assumptions of normal distribution, non-correlation, and constant variance of errors are met.

The Q-Q plot tests whether errors are normally distributed. The below graph shows that the dots fit reasonably on the line, although as we move towards the more extreme quantiles we can see that the distribution shows signs of being leptokurtic and as such might not be normally distributed. Nevertheless, this analysis interprets this distribution of errors as normal.

The Residuals vs Fitted graph tests the assumption of constant variance of errors by plotting residuals against predicted values. As we can see no discernible pattern in the data, we can interpret the graph as showing no indication of the size of the error depending of the predicted value. Thus constant variance of errors is established

The Residuals vs Leverage graph shows us that none of the datapoints have an unreasonably high power to pull the models predictions outwards themselves. This means that there are no outliers in the dataset. This, in combination with the above tests indivated that the model is valid, as it adheres to the integral assumptions of linear regression.

plot(Points_Attit_Reg, which = c(1,2,5))


Introduction to Open Data Science - Week 3, Logistic Regression

Libraries:

library(dplyr)
library(ggplot2)
library(GGally)
library(boot)

2. and 3.
Data Description with Variable Selection and Justification

The below glimpsed dataset “TheData,” contains the questionnaire answers of 382 students from two Portuguese secondary schools . The answers were given by students attending maths and Portuguese language courses, each group having produced their own datasets that have here been combined into one dataset. In the process of combining the data, observations have been selected in a manner that assures that 13 identifying variables do not contain empty values. This has resulted in a reduction from 1044 to 382 observations per variable.

The questionnaire was created to predict the target variable of “G3,” ie. the final grade of the student attending the course. Accordingly the variables can be said to have at least a potential link to school performance, although some variables (such as whether the student lives in an urban or rural area) arguably have a more tenuous theoretical link to school performance than others (such as whether the student receives additional educational support). A glimpse of the data is provided below:

## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F"…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U"…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "L…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servi…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other",…
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation",…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother"…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 1…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, …
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0…
## $ alcoholics <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

For the purposes of this analysis, four variables relating to alcohol use have been selected. The primary purpose of this analysis is to examine the effects of alcohol use on the final grade. Accordingly the variable “G3,” the final grade on a 20-point scale, is a given. As is “alc_use,” the variable mapping alcohol use on a five-point scale where “1” indicates very low consumption and “5” very high consumption (This variable is the mean of the student’s alcohol consumption on weekdays and weekends, mapped by variables “Dalc” and “Walc,” respectively). The hypothesized relationship between “G3” and “alc_use” is that the higher consumption of alcohol is a predictor of lower achievement in school, which is represented by G3. Furthermore, it is hypothesized that the mechanism that might explain any potential causal relationship is the amount of absences (measured in days out of 93) arising from either reduced energy or hangover caused by higher amounts of drinking (I realize that this is a bold assumption to make before examining the relationship between alcohol use, absences and the final grade, but the tasking requires naming the four variables now.) The benefit of this causal explanation is that it does not require knowledge on the effects of alcohol use on the brain, nor does it, in the case of having such knowledge, demand that the high use is long term - a common qualifier with alcohol related learning difficulties, but something for which the dataset contains no data.

As the working theory is that higher alcohol use has a negative effect on school performance, it is also useful to theorize about the reasons behind higher alcohol use. Here two variables are examined: “freetime,” ie. how much free time the student has in a week on a five-point scale (1 denoting very little, 5 very much), and “famrel,” ie. how good the students relationship is to their family on a five-point scale (1 denoting a very bad relationship, 5 an excellent relationship.). The theorized relationships are as follows: the more free time one has, the more they drink to pass the time, and the worse their relationship is with their family, the more they drink for comfort (The same cave-at applies here, as with the previous relationship). These are the relationships that will be explored below: A) The effects of alcohol use on the final grade; B) The effects of alcohol use on absences and the effects of absences on the grade; C) The effects of free time on alcohol use; D) The effects of family relations on alcohol use. Any further interesting relationships will be explored as warranted by the initial results (such as the effects of family relationship, given a lot of free time, on alcohol use).

4.
Numerical and graphical exploration of relationships A through D.

A and B

The above set of graphs explores the relationships between alcohol use, absences and the final grade. The results have been further divided by sex in the spirit of last week. A few noteworthy points can immediately be noticed. Firstly, there seems to be, overall, no statistically significant relationship between the number of absences and the final grade. This, if anything, is a troubling result for Portuguese teachers. Admittedly, with males there seems to be a somewhat statistically significant relationship. On the other hand, alcohol use would seem to predict both higher levels of absences and lower scores, although here too the difference between males and females is significant.

Since there is no theoretical reason for this division, it raises some questions over the data. As such, before delving into the numbers further, we need to examine the data more to see if the cause for these variations between sexes can be explained by abnormalities in the observations. Immediately two observations jump up from the data: in the column where absences are on the Y-axis, we can note two observations, both female, that could constitute outliers. To examine this further, we will carry out a regression analysis where the absences are the explanatory variable for final score, and a regression analysis where the alcohol use is the explanatory variable for absences. Both analysis will be then subjected to the residuals vs leverage test from last week, which will help us indicate whether some of the datapoints have an unreasonably high power to pull the models’ predictions outwards towards themselves.

## 
## Call:
## lm(formula = G3 ~ absences, data = TheData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7235  -1.6055   0.3355   2.3355   6.8688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.72350    0.21879  53.583   <2e-16 ***
## absences    -0.05897    0.03094  -1.906   0.0574 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.299 on 380 degrees of freedom
## Multiple R-squared:  0.009471,   Adjusted R-squared:  0.006865 
## F-statistic: 3.634 on 1 and 380 DF,  p-value: 0.05738

## 
## Call:
## lm(formula = absences ~ alc_use, data = TheData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.417 -3.442 -1.442  1.576 41.558 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2523     0.5918   3.806 0.000165 ***
## alc_use       1.1901     0.2779   4.282 2.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.343 on 380 degrees of freedom
## Multiple R-squared:  0.04603,    Adjusted R-squared:  0.04352 
## F-statistic: 18.33 on 1 and 380 DF,  p-value: 2.349e-05

It seems that the two data points have such a high leverage, as to bring their validity into question. Of course, in the absence of reasoned proof that they are invalid, they should be left in. In the interest of this exercise, I have nevertheless decided to apply the rule of thumb that observations with a Cook’s Distance higher than n/4 (where n is the number of observations) can be removed. Let us see what the end result is, after we apply this procedure to the data.

## 
## Call:
## lm(formula = G3 ~ absences, data = TheData_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7846  -1.6377   0.2888   2.2888   6.9496 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.78457    0.22435  52.529   <2e-16 ***
## absences    -0.07342    0.03461  -2.121   0.0346 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.28 on 376 degrees of freedom
## Multiple R-squared:  0.01183,    Adjusted R-squared:  0.009198 
## F-statistic:   4.5 on 1 and 376 DF,  p-value: 0.03455
## 
## Call:
## lm(formula = absences ~ alc_use, data = TheData_2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.885 -3.395 -1.397  1.603 41.595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4130     0.5349   4.511 8.63e-06 ***
## alc_use       0.9921     0.2533   3.917 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.791 on 376 degrees of freedom
## Multiple R-squared:  0.0392, Adjusted R-squared:  0.03664 
## F-statistic: 15.34 on 1 and 376 DF,  p-value: 0.0001066

The new Dimensions

## [1] 378  35

With the removal of just four observations with a Cook’s distance higher than 4/n, as testified by the new dimensions, we can see that absences now function as a statistically significant predictor of academic performance. I would argue that despite the absence of observation specific reasons supporting the removal of the observations, the overall logical expectation that presence at class predicts performance, and the magnitude of change in the statistical significance of the results, warrants the removal of these values. As such, moving onward, this analysis relies on the now modified dataset.

Finally, to test these results against just the effects of alcohol use on the final grade and the effects of absences, given high alcohol use, we will conduct two more regression analysis:

## 
## Call:
## lm(formula = G3 ~ alc_use, data = TheData_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4094  -1.8989   0.1011   2.1011   6.3459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.3884     0.3645  33.984  < 2e-16 ***
## alc_use      -0.4895     0.1726  -2.836  0.00482 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 376 degrees of freedom
## Multiple R-squared:  0.02094,    Adjusted R-squared:  0.01833 
## F-statistic: 8.041 on 1 and 376 DF,  p-value: 0.004819
## 
## Call:
## lm(formula = G3 ~ absences, data = filter(TheData_2, alc_use > 
##     3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8329  -0.7404   1.1671   1.8142   5.6294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.83286    0.93802  11.549 1.72e-13 ***
## absences    -0.04622    0.11495  -0.402     0.69    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.606 on 35 degrees of freedom
## Multiple R-squared:  0.004599,   Adjusted R-squared:  -0.02384 
## F-statistic: 0.1617 on 1 and 35 DF,  p-value: 0.69

Looking at the results of the regression analysis, we can see that despite all the work that went into removing the high Cook’s Distance observations, alcohol use on its own is still a stronger and statistically more significant predictor of poorer academic performance than the number of absences. We also see that the number of absences given high alcohol use does not provide anything better in terms of predictive power than just absences. As such, contrary to the originally proposed mechanism, while alcohol use is a statistically significant and strong predictor of absences (one move up the alcohol use scale corresponds to almost one full day of additional absences), absences themselves do not function as a strong predictor of poorer academic performance. In fact absences only explain approximately half of the change in final grade that is explained by alcohol use. We can as such conclude that there is strong evidence that while alcohol use results in poorer performance, it does not do that through absences.

C and D

As expected, both negative family relations and free time are statistically significant predictors of alcohol use. We can examine these in more detail with linear regression, as has been done below. We can see that both variables are statistically significant predictors of alcohol use. As for the hypothesized impact of poor family relations, given lots of free time, it does not have an effect larger than just poor family relations. In fact, given free time, poor family relations seem to have a lower effect, but this difference is not statistically significant:

## 
## Call:
## lm(formula = alc_use ~ freetime, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1822 -0.8364 -0.1822  0.5095  3.1636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.31755    0.16884   7.803 6.02e-14 ***
## freetime     0.17294    0.05015   3.449 0.000627 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9604 on 376 degrees of freedom
## Multiple R-squared:  0.03066,    Adjusted R-squared:  0.02808 
## F-statistic: 11.89 on 1 and 376 DF,  p-value: 0.0006273
## 
## Call:
## lm(formula = alc_use ~ famrel, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2930 -0.8469 -0.2576  0.4924  3.2778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.43567    0.22189  10.977  < 2e-16 ***
## famrel      -0.14269    0.05497  -2.596  0.00981 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9668 on 376 degrees of freedom
## Multiple R-squared:  0.0176, Adjusted R-squared:  0.01499 
## F-statistic: 6.738 on 1 and 376 DF,  p-value: 0.009808
## 
## Call:
## lm(formula = alc_use ~ famrel + freetime, data = TheData_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3850 -0.7974 -0.2116  0.5067  3.0067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.92585    0.25381   7.588  2.6e-13 ***
## famrel      -0.17339    0.05452  -3.180 0.001594 ** 
## freetime     0.19586    0.05007   3.912 0.000109 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.949 on 375 degrees of freedom
## Multiple R-squared:  0.05612,    Adjusted R-squared:  0.05108 
## F-statistic: 11.15 on 2 and 375 DF,  p-value: 1.982e-05
## 
## Call:
## lm(formula = alc_use ~ famrel, data = filter(TheData_2, freetime > 
##     3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20823 -1.00998 -0.07606  0.85786  2.99002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.34040    0.45699   5.121 9.64e-07 ***
## famrel      -0.06608    0.11041  -0.599     0.55    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.102 on 143 degrees of freedom
## Multiple R-squared:  0.002499,   Adjusted R-squared:  -0.004477 
## F-statistic: 0.3582 on 1 and 143 DF,  p-value: 0.5504

As such, we can conclude section 4 by summarizing that while alcohol has a negative effect of academic performance, and poor family relations and free time increase alcohol consumption, all of these have (while statistically significant) only have modest impact, if we look at R2: Alcohol use only explains approximately 2% of the variance in the final grades, while poor family relations and free time, even when taken together, only explain approximately 5% of the variance in alcohol consumption. As such, while we have some proof of causal relationships, those relationships are not strong. We can additionally reject the hypothesis that the mechanism by which alcohol consumption affects grades is the increased amount of absences.

5.
Logistic Regression of the above variables.

In the above analysis we have treated alcohol use either as an explanatory variable (A-B) or as a target variable (C-D) in a linear function. Here, alcohol use will be defined as a binomial variable, where individuals having an alcohol consumption higher than 2/low, will be labeled as “alcoholics.” As such, individuals with alc_use of three or higher will belong to the category “alcoholics,” while those with less will not. To model the other above variables within this framework will require the use of Logistic Regression, which calculates the probability of an individual belonging to a category (here, alcoholics), based on the model inputs. A probability higher than 0.5 will indicate belonging to a group.

We will employ all the other variables used above, including the variable absences, since it did have a statistically significant relationship with alcohol use. Consequently we get the following Logistic Regression:

## 
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences + G3, 
##     family = "binomial", data = TheData_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6726  -0.8342  -0.6456   1.1785   2.0220  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.56725    0.73233  -0.775 0.438585    
## famrel      -0.31613    0.12903  -2.450 0.014280 *  
## freetime     0.41441    0.12521   3.310 0.000934 ***
## absences     0.07175    0.02371   3.027 0.002473 ** 
## G3          -0.06959    0.03593  -1.937 0.052781 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.91  on 377  degrees of freedom
## Residual deviance: 424.42  on 373  degrees of freedom
## AIC: 434.42
## 
## Number of Fisher Scoring iterations: 4

The Odds Ratios and Their Confidence Intervals

##                    OR     2.5 %    97.5 %
## (Intercept) 0.5670801 0.1325241 2.3671826
## famrel      0.7289627 0.5649509 0.9387214
## freetime    1.5134784 1.1886426 1.9440711
## absences    1.0743848 1.0268123 1.1277893
## G3          0.9327736 0.8689267 1.0008341

In the above summary we can see that the variables used have a wide range of statistical significance. As the commonly accepted cut-off point for statistical significance is a p score of less than 0.05, alongside absolute z values higher than 2 and 95% confidence intervals that do not include 1, we can conclude that the variable G3, or the final grade, has no statistical significance on our model. As such, we can drop it going forward. * (Confidence intervals going across 1 indicate that the 95% confidence interval contains the coefficient 1, indicating no relationship between the predictor and target variable.)

The odds ratios support our initial hypothesis. Since odds ratios higher than 1 indicate that the variable is positively correlated with the observation/individual belonging to the group (in this case alcoholics), both free time and absences positively predict belonging to the alcoholics group.

Since higher family relations negatively predict belonging to alcoholics, we can conclude that the hypothesized positive impact of bad family relations holds.

Nevertheless, as stated above, the impact of these variables is minor, falling close to even.

(As final grade is not statistically significant, it has been ignored)

6.
The below numerical and graphical explorations detail the accuracy of the model without variable G3. While the plot would seem to indicate a rather random sorting of predictions, a closer examination carried out through the tabulation of predictions against the data showcase a more nuanced model. It is rather clear that the model over predicts non-alcoholics and if it does predict an alcoholic, there is (approximately) a 50/50 chance of that being prediction being right, but since the majority of cases are non-alcoholics, the model’s training error is “only” 0.29, meaning that 29% of the predictions are incorrect. This is above mere random guessing, or flipping the coin, especially since the alcoholics and non-alcoholics are not split 50/50. Nevertheless, we can see both from the graph and the confusion matrix, that the model misses many, many cases where the individual does belong to the group “alcoholics.” As such, it is not a good model.

## 
## Call:
## glm(formula = alcoholics ~ famrel + freetime + absences, family = "binomial", 
##     data = TheData_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7039  -0.8148  -0.6661   1.2152   1.9476  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.33999    0.61829  -2.167 0.030215 *  
## famrel      -0.32602    0.12889  -2.529 0.011424 *  
## freetime     0.41713    0.12437   3.354 0.000797 ***
## absences     0.07582    0.02361   3.211 0.001323 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 455.91  on 377  degrees of freedom
## Residual deviance: 428.17  on 374  degrees of freedom
## AIC: 436.17
## 
## Number of Fisher Scoring iterations: 4

##           prediction
## alcoholics FALSE TRUE
##      FALSE   255   13
##      TRUE     98   12
## [1] 0.2936508

BONUS

By using a ten-fold cross-validation, we can train the data on one-tenth of “TheData” and then check its accuracy (defined by the ratio of incorrect guesses as above) against the remaining nine sets of one-tenth of the data. This is done below. The ratio of 0.3 indicates that the model that was trained on one tenth of the data performs similarly within the rest of the data compared to the model trained on the whole data. It is worse than the one introduced in the DataCamp. I was able to find a better one after having played around with the above variables in connection with sex, failures and goout.

## [1] 0.3042328

I was able to find a better model after having played around with the above variables in connection with sex, failures and goout. This has an error rate of 0.24 in a ten-fold cross-validation

## [1] 0.2407407

THE END!


Introduction to Open Data Science - Week 4, Linear Discriminant Anlaysis and K-Means

2. The “Boston”-dataset

The dataset, “Boston,” used in this analysis can be dowloaded with the “MASS”-package. As such, it can be seen as a training dataset of sorts. It contains 14 variables with a (potential) connection to housing values in the suburbs of Boston. These variables are:

Variable Explanation
“crim” per capita crime rate by town.
“zn” proportion of residential land zoned for lots over 25,000 sq.ft.
“indus” proportion of non-retail business acres per town.
“chas” Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
“nox” nitrogen oxides concentration (parts per 10 million).
“rm” average number of rooms per dwelling.
“age” proportion of owner-occupied units built prior to 1940.
“dis” weighted mean of distances to five Boston employment centres.
“rad” index of accessibility to radial highways.
“tax” full-value property-tax rate per $10,000.
“ptratio” pupil-teacher ratio by town.
“black” 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
“lstat” lower status of the population (percent).
“medv” median value of owner-occupied homes in $1000s.
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Each variable has 506 observations/points of data and the below computational analysis would seem to indicate that it contains no empty points of data and each of the 14x506=7084 observations is numeric/integer. Some observations are ratios, some percentages, and at least one (chas) is a dummy variable coded 0/1.

Empty <- 0
for (x in row(Boston)){
  if (is.na(x) == TRUE){
    Empty <- Empty+1
  }
}

Numer <- 0
for (x in row(Boston)){
  if (is.numeric(x) == TRUE){
    Numer <- Numer+1
  }
}

3. The Graphical Overview of the Data.

Below, the reader can find the simple bar graphs of each variable, as well as a summary of each examined variable:

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

As the above general overview indicates, the data takes various values in various ranges - as one would expect from a dataset containing various different measures. Commenting on all the distributions seems pointless at first sight, but most of the graphs indicate some interesting things.

To begin with, we see the age-graph indicate an aging city. More importantly we see that its automatic scale seems off. Either there is a high amount of areas in the city where the proportion of buildings built before 1940 close to 100, or the dataset has a typo. Or, as a final thought that seems the most likely: many of properties surveyed for this dataset come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.

We see the black-graph empty. Yet again, a closer examination (carried out below) shows that the data takes some interesting values. I do not have the knowledge to say what the measure “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town” should indicate, but a approx. 120 of the properties seem to get a value close to 397, while other values occur only once. In fact, the Summary statistic presented above shows that the value is probably 396,90 and that the variable is also curious since most of its observations are within the range of 370 and 400, but its smallest observation is 0.32. It might be that the small observation has not been multiplied by 1000 as per the formula, since the result would come close to the expected range. Yet again, this can be a sign of a typo, or something else going on. And as above, the repetition of one value might be explained by many of properties surveyed for this dataset coming from the same Boston town/area and hence sharing exactly the same variable observations for some area-specific variables.

The Crim-graph also looks empty, but again, the more detailed look below shows that most values range from 0 to 1, as one would expect from a per-capita rate. The fact that the general graph above has a range of 0 to 75 would indicate a typo or some placeholder value. Perhaps certain observations have been multiplied by hundred to give a percentage, or the person recording the obervation has forgotten a dot/comma. This would seem to be the case based on the summary statistic, since the max-values are high above the median and mean.

The dis-graph looks empty as well, but the below closer look shows the granular level of observations. With no aggregation, the single lines disappear from the graph when it is extended to contain all values. The summary statistic indicates nothing out of the ordinary.

The final empty-appearing graph, indus, seems to suffer from the same issue as the black-graph. As one can see below, most value-counts range between 0 and 10, but around the 17-mark there seems to be one value with a high count, approx 120, of observations. This same issue can also be observed in the tax-, rad-, and p-ratio- graphs, although no detailed look is carried out below. This is further evidence for the fact many of properties surveyed for this dataset might come from the same Boston town/area and hence share exactly the same variable observations for some area-specific variables.

Finally, the zn-graph looks odd as well, but I would argue that that is just the result of strict zoning-laws prohibiting the amount of large properties in most areas (observe the large count in value “0”)

As for the relationships between the variables, the below matrix shows the correlations of each varibale paired with each of the ohters. Of note is the fact that the matrix seemingly indicates that each value has some statistically significant relationship with one each of the other variables. Of note is the only exception, the chas-variable, which is also the only dummy variable. An interesting question in this regard is why the chas variable is the only one that does not have statistically significant correlations with most other variables. The first answer is simple: because the fact that a tract bounds the river has no statistically significant impact on many of the other variables. The second option comes down to the inner workings of R - it might be that the cor.mtest-function used here to map p-values does not function well for dummy variables. No mention of this possibility is given by the ?cor.mtest-command.

On the other hand, it is perhaps not surprising that variables that are expected to be significant predictors of housing prices, also have statistically significant correlations which one-another. Out of these correlations a few should be highlighted in preparation for the coming faces. The variable “crim” (per-capita crime rate by town) seems to a a strong, statistically significant positive correlation with high property-tax properties, as well as properties with easier access to radial highways. Property-crime is a good explanatory factor for these correlations - high tax- and rad-variables indicate high-value targets(former) and/or easy get-away and access options(latter).

higher levels of industry (indus), house age (age), air pollution (nox), pupil-to-teacher ratio (ptratio), and population’s lower status all have a weaker positive correlation with crime rate. This perhaps indicates a second category of neighborhoods compared to the above: the older impoverished industry areas with less access to good education.

Both higher median value and longer distance from employment centers correlated weakly and negatively with a higher crime rate. I have a hard time explaining this. Perhaps it is due to the existence of middle-class suburbs, which are not attractive to property theft due to distance to a poor city center? This conclusion is perhaps supported by the strong negative correlation between the dis-variable on one hand and the indus-, nox-, and age-variables, which would seem to indicate that the (employment) centers of the city are older industry neighborhoods. All of this is of course anecdotal in the absence of clearer information.

Finally, the black-varibale seems to be negatively and weakly correlated with a higher crime rate, but as I do not understand the calculations behind the variable, it is rather hard to interpret the (potential) meaning of the correlation - as such I will drop it going forward.

4. Standardization and Categorization

Below the reader can find a summary of the standardized Boston variables. All of them can be seen to share a mean of zero, which is of course by definition a feature of a standardized variable. They are also all on the same scale now, which means that they can be compared to one another easier - although that would not be immediately clear from the data, since the value-distributions still retain their curious aspects: for example with the variable “black,” the min is still far-far-far to the left from the rest of the data. Additionally, the standardized binomial variable “chas” has arguably become non-sensical. The old value of 0 has been replaced by -0.2723 and the old value for 1 has been replaced by 3.6648.

It should also be noted that none of the variables can be fully standardized into standard normal distributions, since they do not adhere to a normal distribution to begin with. This is, at least in, probably due to the (theorized) over representation of one neighborhood in the dataset.

##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

The reader can also note that in this second set, the crim-varibale has been replaced by the Crime factor-variable, as per instructions, and the chas-variable has be returned to its original binomial state. Even further down, the reader can finally find the test set with the removed Crime variable, after the correct answers had been saved.

##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
##        zn               indus              nox                rm         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-1.4644   Min.   :-3.8764  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.9121   1st Qu.:-0.5681  
##  Median :-0.48724   Median :-0.2109   Median :-0.1441   Median :-0.1084  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.: 0.5981   3rd Qu.: 0.4823  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 2.7296   Max.   : 3.5515  
##       age               dis               rad               tax         
##  Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127  
##  1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668  
##  Median : 0.3171   Median :-0.2790   Median :-0.5225   Median :-0.4642  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294  
##  Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964  
##     ptratio            black             lstat              medv        
##  Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.2746   Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  
##      Crime      Boston.chas     
##  Lowest :127   Min.   :0.00000  
##  Lower  :126   1st Qu.:0.00000  
##  Higher :126   Median :0.00000  
##  Highest:127   Mean   :0.06917  
##                3rd Qu.:0.00000  
##                Max.   :1.00000
##        zn               indus              nox                 rm         
##  Min.   :-0.48724   Min.   :-1.4470   Min.   :-1.35225   Min.   :-3.8764  
##  1st Qu.:-0.48724   1st Qu.:-0.7397   1st Qu.:-0.90350   1st Qu.:-0.6143  
##  Median :-0.48724   Median :-0.0797   Median :-0.14407   Median :-0.2044  
##  Mean   :-0.03724   Mean   : 0.1453   Mean   : 0.04408   Mean   :-0.1773  
##  3rd Qu.:-0.48724   3rd Qu.: 1.0150   3rd Qu.: 0.64339   3rd Qu.: 0.2727  
##  Max.   : 3.58609   Max.   : 2.1155   Max.   : 2.72965   Max.   : 2.3318  
##       age               dis                rad               tax          
##  Min.   :-2.3331   Min.   :-1.24464   Min.   :-0.9819   Min.   :-1.30676  
##  1st Qu.:-1.0711   1st Qu.:-0.88200   1st Qu.:-0.6373   1st Qu.:-0.77572  
##  Median : 0.2407   Median :-0.26385   Median :-0.5225   Median :-0.34851  
##  Mean   :-0.0695   Mean   :-0.07914   Mean   : 0.0191   Mean   : 0.03053  
##  3rd Qu.: 0.9476   3rd Qu.: 0.50107   3rd Qu.: 1.6596   3rd Qu.: 1.52941  
##  Max.   : 1.1164   Max.   : 3.28405   Max.   : 1.6596   Max.   : 1.52941  
##     ptratio            black              lstat               medv        
##  Min.   :-2.5199   Min.   :-3.90333   Min.   :-1.32936   Min.   :-1.9063  
##  1st Qu.:-0.3028   1st Qu.: 0.15651   1st Qu.:-0.73561   1st Qu.:-0.6858  
##  Median : 0.2977   Median : 0.37402   Median :-0.03754   Median :-0.2047  
##  Mean   : 0.1116   Mean   :-0.05187   Mean   : 0.16201   Mean   :-0.1241  
##  3rd Qu.: 0.8058   3rd Qu.: 0.43155   3rd Qu.: 0.76451   3rd Qu.: 0.2166  
##  Max.   : 1.2677   Max.   : 0.44062   Max.   : 3.09715   Max.   : 2.9865  
##   Boston.chas     
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.07843  
##  3rd Qu.:0.00000  
##  Max.   :1.00000

5. Linear Discriminant Analysis (LDA)

Despite the fact that none of the variables adhere to the assumption normal distribution required by the LDA, nor is the Chas-variable continuous as is usually expected, below the reader can find the required LDA-(bi)plot. It contains the categorical crime rate as the target variable and all of the remaining variables as predictor variables (even the black-variable, despite what I said earlier about not using it.) Observing both the biplot and LDA data, we can see that LD1 explains 96 percent of the between-group variance, while LD2 explains three percent and LD3 one percent.

## Call:
## lda(Crime ~ ., data = TrainSet)
## 
## Prior probabilities of groups:
##    Lowest     Lower    Higher   Highest 
## 0.2475248 0.2549505 0.2500000 0.2475248 
## 
## Group means:
##                 zn       indus        nox         rm        age        dis
## Lowest   0.9975950 -0.96831442 -0.9016927  0.4664355 -0.8962317  0.9464894
## Lower   -0.0990564 -0.27533781 -0.5519455 -0.1181898 -0.2676529  0.3383924
## Higher  -0.3666748  0.08572049  0.3456871  0.2335388  0.4101022 -0.3573452
## Highest -0.4872402  1.01715195  1.0760911 -0.3997261  0.8285983 -0.8533932
##                rad        tax    ptratio      black      lstat        medv
## Lowest  -0.6844182 -0.7461689 -0.4815519  0.3738498 -0.7845841  0.53366730
## Lower   -0.5369796 -0.4434751 -0.0216152  0.3185990 -0.1278610 -0.02119605
## Higher  -0.4155974 -0.3386129 -0.3864863  0.1003974 -0.0902456  0.28795823
## Highest  1.6377820  1.5138081  0.7803736 -0.7504971  0.8421769 -0.67606132
##         Boston.chas
## Lowest    0.0300000
## Lower     0.0776699
## Higher    0.1089109
## Highest   0.0500000
## 
## Coefficients of linear discriminants:
##                     LD1         LD2         LD3
## zn           0.07249515  0.68407956 -0.78481634
## indus        0.03837037 -0.36601100  0.42523741
## nox          0.39228148 -0.68871879 -1.37968157
## rm          -0.12585899 -0.16562528 -0.13259686
## age          0.25295819 -0.36972186 -0.09294929
## dis         -0.05219315 -0.19708128  0.08940576
## rad          3.39276845  0.74940773 -0.10015293
## tax          0.01091098  0.32379071  0.44562444
## ptratio      0.08522263 -0.00456574 -0.11142726
## black       -0.13769050  0.06220555  0.19116497
## lstat        0.15011407 -0.19580643  0.39596927
## medv         0.15433151 -0.35843871 -0.19300073
## Boston.chas -0.40691713 -0.08872385  0.50525679
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9501 0.0373 0.0126

6. LDA for Prediction

The table below showcases the cross-tabulated results of the predictions against the actual categories. We can see that the model predicts rather accurately the group belonging in for properties in higher and high crime rate areas, while it struggles a bit more in the lower and lowest categories. Indeed the model seems to slightly over-predict higher crime rates, especially when provided with data of a property in a lower/lowest crime rate area. Nevertheless, it out-performs simple quessing, under which a property would have an almost equal 25% chance in belonging to any of these categories, as indicated in the previous section’s model output. As such, a simple random division of the properties into four equal-sized groups would result, on average, in three incorrect predictions per one correct prediction. Such odds are much worse than the odds for the model correctly predicting a property belonging to a lowest crime rate area.

##          predicted
## correct   Lowest Lower Higher Highest
##   Lowest      13    11      3       0
##   Lower        5    14      4       0
##   Higher       0    10     13       2
##   Highest      0     0      0      27

7. K-Means Analysis

The below two graphs showcase the results of the final K-Means analysis. The first graph details the change the total within cluster sum of squares as we increase the amount of clusters from 1 to 14. The aim is to use the graph to find the optimal amount of clusters. As it is clear that a more granular level will lead to smaller within cluster sum of squares (WCSS) without necessarily being a better grouping devise (Consider for example that the smallest within cluster sum of squares comes from having only one observation in each “cluster”, meaning that no clustering has been done), we need to find a point where the WCSS drops drastically, indicating an amount of clusters that is significantly more precise than a smaller amount, but not significantly less precise than a larger amount. The first graph indicates that that point is two (2) clusters.

As for the pairs analysis produced by the clustering of pairs of variables into two clusters, we will only discuss the top row/first column of the graphs, which relate to the crime rate. This is done for purposes of limiting the discussion to the relevant aspects and not covering each of the 182 squares. What we need to keep in mind is that K-Means analysis that clusters into two groups attempts to find sets of two sets of data, where the total (in this case) euclidean absolute distances to the group mean are the smallest. If we were to have a single group which shares many of it observation values, then it would be expected, that such a group would repeat itself in each graph. And, indeed, we see most of the crime-graphs maintain a very similar, flat/narrow red-group structure throughout the groupings. To me, this is further evidence that the 120 uncommonly consistently-valued observations that section 2 identified in multiple variables come from a single group of properties from the same area. Perhaps the data showcase something else as well, but hopeflly this will suffice. This is already a long text.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.2663  4.6116  4.7275  5.9572 13.8843